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Abstract.  The  aim  of  this  paper  is  to  present  and  discuss  the  issue  of  the  adequacy  of  the  Minimum 
Intelligent  Signal  Test  (MIST)  as  an  alternative  to  the  Turing  Test.  MIST  has  been  proposed  by  Chris 
McKinstry  as  a  better  alternative  to  Turing's  original  idea.  Two  of  the  main  claims  about  MIST  are 
that  (1)  MIST  questions  exploit  commonsense  knowledge  and  as  a  result  are  expected  to  be  easy  to 
answer  for  human  beings  and  difficult  for  computer  programs;  and  that  (2)  the  MIST  design  aims  at 
eliminating  the  problem  of  the  role  of  judges  in  the  test.  To  discuss  these  design  assumptions  we  will 
present  Peter  D.  Turney's  PMI-IR  algorithm  which  allows  for  MIST-type  questions  to  be  answered. 
We  will  also  present  and  discuss  the  results  of  our  own  study  aimed  at  the  judge  problem  for  MIST. 
Keywords:  Turing  test.  Minimum  Intelligent  Signal  Test,  subcognitive  questions,  commonsense 
knowledge.  Artificial  Intelligence 
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1.  Introduction 

In  his  seminal  paper  Alan  Turing  proposed  a  test  for  machines.1  A  machine  would  pass 
the  test  if  it  were  capable  of  having  a  convincing,  human-like  tele-typed  conversation 
with  a  human  judge  (the  parties  to  the  test  cannot  see  or  hear  one  another).  Since  then, 
the  Turing  test  (hereafter  TT)  has  been  widely  discussed  by  philosophers,  psycholo¬ 
gists,  computer  scientists  and  cognitive  scientists.2  Despite  the  fact  that  it  was  proposed 
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2  See  e.g.,  Konar  (2000);  Harnish  (2002). 
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more  than  sixty  years  ago,  TT  is  still  considered  as  a  fruitful  theoretical  idea.3  It  is  worth 
stressing  that  the  TT  idea  has  also  practical  applications  —  see  e.g.,  the  Loebner  contest4 
or  CAPTCHA  systems.5 

The  judge's  perspective  in  TT  is  one  of  the  central  issues  when  we  try  to  evaluate 
this  test  setting6  and  had  already  been  noticed  by  Turing.  His  suggestion  was  that  the 
interrogator  should  be  a  person  who  is  not  an  expert  in  the  field  of  computing  machines.7 
Such  a  requirement  stemmed  from  the  fact  that  Turing  was  aware  that  the  beliefs  and 
knowledge  of  the  interrogator  might  play  an  important  role  in  the  running  of  the  test. 
The  judge  bias  is  often  pointed  out  as  one  of  the  main  drawbacks  of  the  Turing  test.  Ned 
Block,  for  example,  writes: 

[cjonstrued  as  a  proposal  about  how  to  make  the  concept  of  intelligence  precise, 
there  is  a  gap  in  Turing's  proposal:  we  are  not  told  how  the  judge  is  to  be  chosen. 
A  judge  who  was  a  leading  authority  on  genuinely  intelligent  machines  might  know 
how  to  tell  them  apart  from  people.  For  example,  the  expert  may  know  that  current 
intelligent  machines  get  certain  problems  right  that  people  get  wrong.  [...]  A  stupid 
judge,  or  one  who  has  had  no  contact  with  technology,  might  think  that  a  radio  was 
intelligent.  People  who  are  naive  about  computers  are  amazingly  easy  to  fool  [,..].8 

This  issue  has  a  very  practical  dimension,  as  the  problem  of  selecting  judges  for 
TT  becomes  even  more  important  when  we  think  of  the  Loebner  contest  (hereafter  LC). 
LC  to  a  large  extent  may  be  treated  as  a  practical  realization  of  TT  and,  as  such,  it  reveals 
certain  problems  with  TT's  design.  The  analysis  of  transcripts  of  2009-2012  Loebner 
contest  editions  sheds  more  light  on  the  role  of  the  judge  in  LC:  "The  biggest  drawback 
of  LC  is  that  the  judge  knows  that  the  conversation  is  taking  place  with  a  human  and 
a  program,  and  the  task  is  only  to  decide  which  is  which.  That  makes  it  a  much  harder 
task  for  the  program.  It  is  not  enough  to  exhibit  intelligent  behaviors  and  hold  a  decent 
conversation  —  the  program  has  to  be  at  least  as  human-like  as  the  competing  human."9 

What  is  more,  one  may  argue  that  judges  will  never  have  a  "normal"  conversation 
in  LC.  The  reason  for  this  is  that  they  are  placed  in  the  test-like  environment  with  the 
main  aim  set  at  identifying  contestants. 

Several  solutions  to  the  judge  bias  problem  may  be  found  in  the  literature.  They 
range  from  Loebner's  idea10  to  employ  journalists  as  judges  to  the  concept  of  introduc¬ 
ing  a  kind  of  protocol  for  the  Loebner  contest  (regulating  the  range  of  problems  and 
questions  allowed  for  the  contest).11  However,  there  are  two  concepts  put  forward  to 
modify  the  general  TT  setting  which  we  find  especially  interesting  and  promising.  One 

3  See  Saygin  et  al.  (2001);  Shieber  (2004);  Epstein  et  al.  (2009)  or  Lupkowski  and  Wisniewski  (2011). 

4  Loebner  (2009). 

5  Ahn  et  al.  (2003). 

6  See  the  discussion  in  Lupkowski  (2010)  and  (2011). 

7  Cf.  Turing  (1950):  442;  Newman  et  al.  (1952):  4. 

8  Block  (1995):  379. 

9  Lupkowski  and  Rybacka  (2016):  361. 

10  Loebner  (2009). 

11  See  Garner  (2009)  or  Watt  (2009). 
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of  them  is  the  Unsuspecting  Turing  Test  (UTT)  and  the  other  one  is  the  Minimum  Intelligent 
Signal  Test  (MIST).  Both  proposals  draw  on  Turing's  original  idea,  but  change  the  way 
the  testing  is  performed,  and  as  such  they  aim  at  eliminating  the  judge's  issue  from  the 
picture.  What  is  also  appealing,  UTT  and  MIST  are  designed  in  such  a  way  that  it  makes 
possible  to  test  their  main  assumptions. 

The  Unsuspecting  Turing  Test  was  proposed  in  1994  a  short  paper  by  Michael 
Mauldin.12  Mauldin  used  the  TinyMud  game  (a  text-based  multiplayer  RPG  game)  and 
introduced  a  hot  (named  ChatterBot)  into  the  game.  He  observed  that  the  hot  was  often 
taken  for  a  human  player.  As  Mauldin  writes:  "The  ChatterBot  succeeds  in  the  Tiny¬ 
Mud  world  because  it  is  an  unsuspecting  Turing  test,  meaning  that  the  players  assume 
everyone  else  playing  is  a  person,  and  will  give  the  ChatterBot  the  benefit  of  the  doubt 
until  it  makes  a  major  gaffe."13 

The  Minimum  Intelligent  Signal  Test  was  proposed  by  Chris  McKinstry.14  His 
idea  is  to  set  such  rules  for  TT  that  will  allow  to  perform  it  automatically.  McKinstry 
claims  that  it  would  be  possible  if  only  yes/ no  questions  would  be  allowed  in  the  test 
and  if  an  interrogator  would  evaluate  patterns  of  answers  instead  of  single  answers.  The 
very  idea  is  to  compare  patterns  of  answers  to  the  same  set  of  questions  obtained  from 
a  machine  with  the  ones  obtained  from  human  beings. 

In  this  paper  we  focus  our  attention  on  MIST  with  our  aim  being  an  evaluation 
of  it  as  an  alternative  to  TT.  We  start  by  introducing  MIST  design  and  its  core  assump¬ 
tions:  (1)  that  MIST  questions  should  be  easy  for  humans  and  difficult  for  machines  and 
(2)  that  MIST  results  should  be  easy  to  evaluate.  This  is  followed  by  discussing  these 
assumptions  with  respect  to  the  PMI-IR  algorithm  proposed  by  Peter  D.  Turney,  which 
was  designed  to  answer  MIST-like  questions.  Then  we  present  the  results  of  our  own 
study  aimed  at  the  judge  problem  for  MIST. 

2.  The  Minimum  Intelligent  Signal  Test 

MIST  was  first  described  by  McKinstry  in  a  very  short  (two-page)  paper  "Minimum 
Intelligence  Signal  Test:  an  Objective  Turing  Test."15  McKinstry  claims  that  the  main 
problem  with  the  TT  setting  is  that  it  gives  us  a  binary  answer16  when  it  comes  to  ma¬ 
chines'  intelligence: 


12  See  Mauldin  (1994)  and  discussion  in  Mauldin  (2009). 

13  Mauldin  (1994):  17.  At  this  point  it  may  be  noticed  that  TT  and  its  alternatives  (like  UTT  and  MIST 
mentioned  here)  put  stress  on  the  artificial  agent's  performance.  As  it  was  pointed  out  by  an  anon¬ 
ymous  referee  it  would  be  beneficial  to  consider  the  ability  to  make  mistakes  itself  as  the  criterion 
of  mentality.  This  idea  is  explored  e.g.,  within  the  theory  of  minds  as  semiotic  systems  —  see  Fetzer 
(1995),  (1997). 

14  McKinstry  (1997),  (2009). 

15  McKinstry  (1997).  More  extensive  description  of  MIST  may  be  found  in  McKinstry  (2009). 

16  It  is  worth  mentioning  that  Turing's  idea  is  not  that  simplistic.  He  assumed  that  an  agent  should 
be  tested  long  enough  to  gain  more  reliable  results.  As  it  is  clearly  stated  in  "Can  Digital  Computers 
Think":  "We  had  better  suppose  that  each  jury  has  to  judge  quite  a  number  of  times,  and  that  some¬ 
times  they  really  are  dealing  with  a  man  and  not  a  machine.  That  will  prevent  them  saying  'It  must 
be  a  machine'  every  time  without  proper  consideration".  Newman  et  al.  (1952):  5;  see  also  Turing 
(1950):  442. 


37 


Pazvel  Lupkoivski,  Patrycja  Juroivska  °  The  Minimum  Intelligent  Signal  Test  (MIST).. . 


The  'all-or-nothing'  nature  of  the  Turing  Test  makes  it  of  no  use  in  the  creation  or 
measurement  of  emerging  intelligent  systems  —  it  can  only  tell  us  if  we  have  an  in¬ 
telligent  system  after  the  fact.  What  we  really  need  is  a  Turing-like  test  that  admits 
degrees  and  treats  intelligence  as  at  least  a  human  continuum  —  a  test  that  would 
allow  us  to  measure  the  minimum  amounts  of  global  human  intelligence  that  are  the 
precursors  of  full  adult  human  intelligence  —  a  test  that  can  be  easily  automated  so 
it  can  be  executed  at  machine  speeds.17 

To  achieve  such  a  goal,  McKinstry  proposes  a  test  in  which  only  yes/ no  questions 
are  allowed.  This  ensures  that  the  tested  agent  will  not  have  the  opportunity  to  provide 
misleading  or  evasive  answers  (as  it  is  often  visible  in  LC  transcripts)  —  it  has  to  provide 
a  simple  "yes”  or  "no"  to  a  given  question.18  Such  a  setting  allows  for  automatization 
with  respect  to  running  the  test  and  also  for  evaluating  provided  answers.  This  —  in 
theory  —  should  eliminate  judge's  bias.  McKinstry  claims  that  evaluation  of  MIST  boils 
down  to  a  simple  comparison  of  answers  provided  by  a  tested  agent  with  those  provided 
to  the  same  questions  by  human  participants. 

As  for  the  content  of  questions  in  MIST  McKinstry  writes  that  they  should  ad¬ 
dress  our  commonsense  knowledge  about  the  world,  as  e.g.,  "Do  you  exist?",  "Are  you 
a  rock?",  "Are  you  a  human  being?".19  He  also  claims  that  the  subcognitive  questions 
proposed  by  Robert  French  would  be  a  perfect  inspiration  for  MIST  questions.  As  French 
puts  it: 

Surely,  we  would  not  want  to  limit  a  Turing  Test  to  questions  like  'What  is  the  capital 
of  France?'  or  'How  many  sides  does  a  triangle  have?.'  If  we  admit  that  intelligence 
in  general  must  have  something  to  do  with  categorization,  analogy  making,  and  so 
on,  we  will  of  course  want  to  ask  questions  that  test  these  capacities.  But  these  are 
the  very  questions  that  will  allow  us,  unfailingly,  to  unmask  the  computer.20 

Subcognitive  questions  should  be  designed  to  reveal  low-level  cognitive  struc¬ 
tures,  that  is  "the  subconscious  associative  network  in  human  minds  that  consists  of 
highly  overlapping  activatable  representations  of  experience."21  This  assumption  makes 
such  questions  difficult  for  machines,  as  they  require  acquiring  intelligence  about  the 
world  by  experiencing  it  in  the  way  human  beings  do  during  their  lifetime.  Examples 
of  such  questions  are  the  following: 

•  On  a  scale  of  0  (completely  implausible)  to  10  (completely  plausible),  please  rate 
'Flugly'  as  the  name  a  child  might  give  its  favorite  teddy  bear. 

•  On  a  scale  of  0  (completely  implausible)  to  10  (completely  plausible),  please  rate 
banana  splits  as  medicine. 


17  McKinstry  (2009):  286. 

18  McKinstry  (2009):  289. 

19  See  McKinstry  (2009):  290. 

20  French  (1990):  63. 

21  French  (1990):  56-57. 
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•  On  a  scale  of  0  (completely  implausible)  to  10  (completely  plausible),  please  rate 
purses  as  weapons. 

•  Please  rate  the  following  smells  (1  —  very  bad,  10  —  very  nice): 

a)  Newly  cut  grass; 

b)  Freshly  baked  bread; 

c)  A  wet  bath  towel; 

d)  Ground  pepper.22 

McKinstry's  idea  was  to  create  a  database  of  questions  of  this  kind,  where  each 
question  is  connected  with  an  answer.  Such  a  pair  is  called  a  mindpixel  (see  examples  of 
such  question-answer  pairs  presented  in  the  Appendix  of  this  paper).  In  order  to  col¬ 
lect  such  data  McKinstry  started  the  MindPixel  project,  for  which  internet  users  could 
contribute  mindpixels  to  a  large  database  (the  project  was  active  from  2000  to  2005).  As 
McKinstry  put  it  in  the  online  interview:  "The  first  phase  is  a  completely  public,  inter¬ 
net  based  effort.  All  the  data  it  will  be  collecting  will  come  from  average  people,  with 
no  specific  training  in  AI  or  psychology."23  Such  a  corpus  will  be  then  used  for  MIST. 
As  for  the  MIST  procedure,  McKinstry  describes  it  in  the  following  manner.24 

1.  N  items  (i.e.  yes/ no  questions)  are  generated.  For  all  these  items,  humans 
should  be  able  to  provide  an  answer  (affirmative  or  negative).  The  distribution 
of  items  should  be  that  for  about  50%  expected  reaction  should  be  positive 
and  negative  for  the  rest  (this  proportion  is  aimed  at  reducing  the  bias  for 
answering  yes/ no  questions.25  At  this  stage,  we  also  collect  the  answers  from 
human  participants  and  as  an  effect  we  obtain  a  large  corpus  of  questions  and 
human-intelligence  answers. 

2.  Items  are  presented,  and  responses  recorded.  Items  should  be  presented  in 
a  random  order  and  on  subsequent  re-trials,  item  order  is  re-randomized. 

3.  For  each  item  a  judge  evaluates  an  item/ response  pair  as  either  consistent 
or  inconsistent  with  human  intelligence.  McKinstry  claims  that  this  grading 
procedure  may  be  easily  automated,  reducing  the  chance  of  the  grading  error 
or  an  unforeseen  bias. 

4.  Generate  Score.  The  result  is  not  "all  or  nothing"  for  a  tested  machine.  We 
only  obtain  the  percentage  in  which  the  machine's  answers  are  evaluated  as 
human-like  intelligent.  This  level  should  be  more  than  50%. 

Summing  up,  the  MIST  setup  should  eliminate  the  judge's  bias  from  the  test  re¬ 
sults.  Its  second  stage  would  be  unproblematic  for  judges  —  it  even  may  be  automated. 
What  is  more,  due  to  the  nature  of  its  questions  (addressing  commonsense  knowledge 
contributed  by  non-experts)  and  the  procedure  in  which  they  are  collected,  they  should 
be  easy  for  human  participants  but  difficult  for  machines  (for  the  same  reasons  as  pro¬ 
vided  by  French  for  subcognitive  questions). 

Let  us  now  confront  these  assumptions,  first  with  the  PMI-IR  algorithm,  and  then 
with  the  results  of  the  practical  evaluation  of  MIST  results. 

22  See  French  (1990),  (2000). 

23  McKinstry  (2000). 

24  See  McKinstry  (1997),  (2009). 

25  See  McKinstry  et  al.,  (2008). 
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3.  The  PMI-IR  algorithm  and  MIST  questions 

The  PMI-IR  algorithm  was  proposed  by  Peter  D.  Turney  in  his  paper  "Answering  Sub- 
cognitive  Turing  Test  Questions:  A  Reply  to  French."26  The  PMI-IR  stands  for  Pointwise 
Mutual  Information  (PMI)  and  Information  Retrieval  (IR).  The  algorithm  measures  the 
semantic  similarity  between  pairs  of  words  or  phrases.  This  involves  issuing  queries 
to  a  search  engine  and  applying  statistical  analysis  to  the  results.  As  Turney  states: 
"[t]he  power  of  the  algorithm  comes  from  its  ability  to  exploit  a  huge  collection  of  text."27 
(The  technical  details  of  the  algorithm  are  far  beyond  the  reach  of  this  paper,  but  they 
are  explained  in  detail  in  the  aforementioned  papers.) 

The  PMI-IR  algorithm  has  been  tested  against  synonym  recognition  questions 
retrieved  from  two  standard  tests  for  English  learners:  TOEFL  and  ESL.28  The  PMI-IR 
overall  result  for  TOEFL  reached  73.75%  (for  80  questions)  and  for  ESL  74%  (for  50 
questions). 

Turney  also  used  the  PMI-IR  to  generate  answers  to  French's  subcognitive  ques¬ 
tions.  When  the  algorithm  was  applied  to  the  questions  retrieved  from  French's  paper29 
it  was  able  to  reproduce  the  expected  results.  Let  us  consider  one  example  here,  namely 
the  Flugly  question. 

On  a  scale  of  1  (awful)  to  10  (excellent),  please  rate: 

•  Flow  good  is  the  name  Flugly  for  a  glamorous  Hollywood  actress? 

•  How  good  is  the  name  Flugly  for  an  accountant  in  a  W.C.  Fields  movie? 

•  How  good  is  the  name  Flugly  for  a  child's  teddy  bear? 

French  expects  the  following  results:  "most  people  would  agree  that  Flugly  would 
be  a  downright  awful  name  for  a  sexy  actress,  a  good  name  for  a  character  in  a  W.C. 
Fields  movie,  and  a  perfectly  appropriate  name  for  a  child's  teddy  bear."30 

The  PMI-IR  algorithm  assigned  the  following  marks: 

•  Flugly  for  a  glamorous  Hollywood  actress  =  V, 

•  Flugly  for  an  accountant  in  a  W.C.  Fields  movie  =  2; 

•  Flugly  for  a  child's  teddy  bear  =  10. 

One  may  easily  notice  that  they  are  intuitive  and,  what  is  more,  in  line  with 
French's  predictions  when  it  comes  to  the  ranking  of  the  names:  actress  <  accountant 
<  bear  (Turney  points  out  that:  "[p]erhaps  French  would  give  a  higher  score  for  Flugly 
as  an  accountant,  but  an  informal  survey  suggests  that  the  above  ratings  are  quite  hu¬ 
man-like"31). 


26  Turney  (2001a).  The  algorithm  is  also  described  in  (Turney  2001b). 

27  Turney  (2001a). 

28  Turney  (2001a),  (2001b). 

29  French  (2000). 

30  French  (2000):  336. 

31  Turney  (2001a). 
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Turney  was  able  to  repeat  such  results  for  other  types  of  questions  proposed  by 
French.  He  concludes  the  paper  in  the  following  way. 

French  (1990,  2000)  has  argued  that  the  Turing  Test  is  too  strong,  because  a  machine 
could  be  intelligent,  yet  still  fail  the  test.  I  agree  with  this  general  point,  but  I  disa¬ 
gree  with  the  specific  claim  that  an  intelligent  but  disembodied  machine  cannot  give 
humanlike  answers  to  subcognitive  questions.  I  show  that  a  simple  approach  using 
statistical  analysis  of  a  large  collection  of  text  can  generate  seemingly  human-like 
answers  to  subcognitive  questions.32 

At  this  point  it  is  also  worth  mentioning  the  IBM  computer's  success  in  Jeopardy!  — 
the  American  general-knowledge  game  show.  In  this  show  questions  could  be  about 
anything,  and  they  often  rely  on  complex  wordplay.  To  make  things  more  complicat¬ 
ed,  the  contestant  has  to  supply  the  correct  question  to  a  given  clue.  A  typical  example 
may  be:  "As  an  adjective,  it  means  "timely";  in  the  theatre,  it's  to  supply  an  actor  with  a 
line."33  The  correct  response  is:  "What  does  "prompt"  mean?".  In  2011  an  IBM  computer 
named  Watson  defeated  two  Jeopardy!  champions  Ken  Jennings  and  Brad  Rutter.34  This 
illustrates  the  abilities  of  a  modern  day  AI  system  in  the  question  processing  domain. 

The  results  achieved  with  the  use  of  the  PMI-IR  and  aforementioned  AI  success 
in  Jeopardy!  suggest  that  there  are  classes  of  questions  which  address  commonsense 
knowledge  and  which  are  clearly  available  for  machines.  This  makes  the  first  assumption 
of  MIST  we  consider  here  at  least  problematic  —  common  sense  questions  like  the  one 
recommended  for  MIST  are  easy  to  answer  for  humans  as  well  as  for  modern  computers. 

4.  Judge's  perspective  in  MIST 

In  this  section  we  will  take  a  closer  look  at  the  MIST  assumption  stating  that  evaluating 
MIST  results  would  not  be  problematic  for  judges.  To  this  end  we  designed  an  online 
study  in  which  one  group  of  participants  played  the  role  of  a  judge  in  MIST  and  the 
other  group  simply  took  part  in  MIST  as  tested  agents.  We  present  the  details  below. 

4.1.  Methods  and  Procedure 

For  our  study  we  used  two  questionnaires  consisting  of  50  questions  retrieved  from  the 
MIST  project.  As  for  the  selection  of  questions,  we  eliminated  those  which  contained 
vulgarisms  and  serious  grammatical  errors,  which  made  them  hard  to  understand  (like 
e.g.  "Was  thomas  nixon  born  in  year?").  Despite  this,  we  have  not  applied  any  restrictions 
on  selected  questions.  Below  we  present  exemplary  questions  used  in  the  study  (asso¬ 
ciated  with  the  predefined  answers  retrieved  from  the  MIST  project,  original  spelling  is 
preserved).  The  complete  list  of  questions  may  be  found  in  the  Appendix  of  this  paper. 


32  Turney  (2001a):  419. 

33  See  Dormehl  (2016):  137. 

34  Dormehl  (2016):  138. 


41 


Pazvel  Lupkowski,  Patrycja  Juroivska  °  The  Minimum  Intelligent  Signal  Test  (MIST).. . 


□  Is  Madonna  a  woman?  YES 

□  Does  Santa  Claus  deliver  gifts  on  Easter?  NO 

□  In  general,  do  we  need  light  to  see?  YES 

□  Is  a  cell  something  that  can  contain  either  a  nucleus  or  a  prisoner?  YES 

□  Are  most  cats  furry?  YES 

□  Does  wood  comes  from  trees?  YES 

□  Do  all  mammals  need  oxygen  to  live?  YES 


Both  questionnaires  were  built  with  the  use  of  the  same  set  of  MIST  questions. 
The  first  questionnaire  (hereafter  Ql)  had  the  following  instruction. 

Your  task  in  this  study  is  to  play  the  role  of  a  judge  who  is  evaluating  answers  given 
by  a  computer  program.  These  answers  were  given  for  a  simple  yes/ no  questions. 
Read  the  question  and  the  answer  provided  by  the  program.  Afterwards  evaluate  on 
a  scale  1  (I  strongly  agree)  to  5  (I  strongly  disagree)  the  degree  in  which  you  agree 
with  the  provided  answer.  If  you  do  not  agree  with  the  answer  or  it  is  in  some  sense 
problematic  for  you,  please  give  us  your  comment  in  the  field  'Comment.' 

After  this  instruction,  the  subject  was  presented  with  the  list  of  MIST  questions 
associated  with  answers,  scale  for  evaluating  answers,  and  a  "Comment"  text-field. 

In  the  second  questionnaire  (hereafter  Q2)  subjects  were  simply  participants  of 
MIST.  They  were  presented  with  a  list  of  questions  and  their  task  was  to  provide  answers 
("yes"  or  "no"). 

Both  Ql  and  Q2  ended  with  questions  covering  the  age,  gender  and  education 
of  subjects.  The  last  question  addressed  the  issue  of  a  rough  estimation  of  computer-use 
fluency:  "Your  web-browser  started  to  display  many  commercials.  This  makes  browsing 
the  internet  very  hard.  What  do  you  do?"  Possible  answers  were:  "a)  I  try  to  solve  it  my 
own"  or  "b)  I  try  to  find  someone  to  solve  this  problem  for  me." 

Both  questionnaires  were  presented  with  the  use  of  Google  Forms.35  The  study 
was  conducted  online.  Subjects  were  recruited  with  the  use  of  social  media  and  internet 
forums  (of  a  wide  topical  spectrum,  e.g.  Joemonster,  Wykop)  as  we  wanted  to  gather  a 
research  group  with  a  variety  of  subjects.  Attention  was  paid  in  the  recruitment  process 
to  ensure  that  no  subject  would  take  fill  in  both  questionnaires.  For  each  address,  only 
one  invitation  for  one  questionnaire  was  sent. 

Our  main  research  goal  was  to  check  McKinstry's  claim  that  MIST  results  would 
be  easy  for  judges.  A  judge  confronted  with  the  MIST  result  should  not  have  any  prob¬ 
lems  when  evaluating  answers  of  subjects.  As  a  measurement  of  how  difficult  the  evalu¬ 
ation  task  is,  we  have  chosen  Fleiss'  Kappa,36  which  is  a  statistical  measure  of  inter-rater 
reliability.  If  this  measure  is  high  for  the  judges  group,  we  can  assume  that  they  evaluated 
MIST  answers  with  a  high  degree  of  agreement  and  consequently  that  the  judgment 
task  was  not  problematic.  Thus,  our  first  research  hypothesis  is  that  (HI)  for  the  group 
of  judges  (Ql)  we  would  observe  a  high  level  of  agreement. 

35  http://forms.google.com. 

36  Cf.  Carletta  (1996). 
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We  apply  the  same  reasoning  to  the  claim  that  MIST  questions  are  easy  to  answer 
for  humans.  Thus,  our  (H2)  is  that  also  for  the  group  of  MIST  participants  (Q2)  we  will 
observe  a  high  level  of  agreement.  For  the  kappa  interpretation,  we  use  values  proposed 
by  Viera  and  Garrett.37 

4.2.  Subjects 

The  research  group  consisted  of  263  subjects.  126  subjects  filled  out  Q1  —  i.e.,  played 
the  role  of  a  judge  in  MIST.  The  group  consisted  with  83%  women  and  17%  men,  aged 
19-65  (mean=37.93,  SD=13.98).  The  majority  of  the  group  had  higher  education  (32%)  or 
were  still  studying  (23%).  82%  of  subjects  pointed  the  answer  (a)  to  the  question  about 
the  computer  fluency  —  so  they  declared  that  they  will  try  to  cope  with  the  browser 
problem  by  their  own.  137  subjects  filled  out  Q2  —  i.e.  took  part  in  MIST.  The  group 
consisted  with  40%  women  and  60%  men,  aged  12-67  (mean=28.81,  SD=8.68).  As  for 
the  first  group,  the  majority  had  higher  education  (53%).  For  this  group,  93%  of  subjects 
declared  (a)  to  be  the  answer  to  the  question  about  computer  fluency. 

4.3.  Results 

For  data  analysis  we  used  R  statistical  software.38  In  Table  1  we  present  Fleiss'  Kappa 
measures  for  Q1  (MIST  judges)  and  Q2  (MIST  participants). 


Table  1.  The  study  results  —  Fleiss'  Kappa  for  Q1  and  Q2.  Fleiss  kappa  interpretation  by  Viera  and  Garrett 
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Questionnaire 

N 

Fleiss'  Kappa 

Kappa  interpretation 

Ql 

126 

0.05 

Slight  agreement 

Q2 

137 

0.79 

Substantial  agreement 

As  may  be  noticed,  the  first  hypothesis  was  not  confirmed.  MIST  judges  reached 
only  slight  agreement  (K=0.05)  for  their  assessments  of  MIST  answers.  The  result  indicates 
that  the  task  of  evaluating  a  MIST  answer  is  not  easy.  Several  judges  confronted  with  one 
question  and  a  yes/ no  answer  to  this  question  may  disagree  on  evaluating  the  answer. 

As  for  the  second  hypothesis,  it  is  confirmed.  The  agreement  for  MIST  partici¬ 
pants  was  substantial.  This  suggest  that  the  task  of  answering  MIST  questions  is  rather 
simple  and  many  subjects  confronted  with  such  a  question  will  agree  on  the  answer. 

For  a  better  understanding  of  judges'  evaluations  for  Ql,  we  also  asked  our  sub¬ 
jects  to  provide  additional  explanations.  Analysis  of  these  explanations  shows  that  simple 
answering  of  questions  is  not  as  problematic  as  evaluating  answers.  When  confronted 
with  such  a  task,  subjects  began  to  analyze  the  question  itself  and  become  more  critical. 
The  effect  is  analogous  to  the  one  for  the  Turing  test  or  the  Loebner  Contest.  There  are 
two  distinct  tendencies  of  judges  which  may  be  observed  in  the  collected  explanations. 


37  Viera  and  Garrett  (2005). 

38  R  Core  Team  (2013). 

39  Viera  and  Garrett  (2005). 


43 


Pazvet  Lupkowski,  Patrycja  Juroivska  °  The  Minimum  Intelligent  Signal  Test  (MIST).. . 


The  first  one  refers  to  a  lack  of  knowledge.  Certain  questions  were  too  specialized  or  cul¬ 
turally  oriented,  like  "Has  SETI  discovered  extraterrestrial  life?  [NO];"  "Is  Idaho  in  Eu¬ 
rope?  [NO]."  For  these  question-answer  pairs  judges  often  commented  "I  do  not  know," 
"I  would  first  have  to  check  what  SETI  is."  The  second  tendency  is  that  when  confronted 
with  a  simple  question,  judges  somehow  do  not  believe  that  it  is  that  simple.  The  effect 
is  that  even  for  intuitive  questions,  like  "Does  1  plus  1  equal  3?  [NO]"  or  "Is  Napoleon 
dead?  [YES]"  subjects  try  to  provide  a  context,  when  the  answer  given  is  not  a  proper 
one.  E.g.,  for  the  question  about  Napoleon  example  comments  were  the  following: 
"Maybe  there  is  someone  else  named  Napoleon  and  this  person  is  alive,"  "Napoleon  is 
alive  in  history,"  "My  roommate's  cat  is  called  Napoleon,  and  it  is  all  well."  There  were 
also  questions  and  answer  pairs,  which  were  commented  as  highly  controversial.  Many 
comments  of  the  form  "It  depends"  or  "It  is  controversial"  appeared.  Examples  of  such 
questions  are  the  following:  "Do  people  sometimes  lie?  [YES];"  "Do  you  need  to  get  a 
license  to  have  children?  [NO];"  "Is  war  better  than  peace?  [NO]." 

5.  Summary 

In  this  paper  we  focused  on  one  of  the  most  interesting  alternatives  to  the  Turing  test. 
McKinstry's  Minimum  Intelligent  Signal  Test  aims  at  providing  a  test  better  suited  for 
thinking  machines.  We  have  evaluated  two  assumptions  made  by  the  MIST  author: 
questions  in  MIST  should  be  easy  for  human  beings  and  difficult  for  machines,  and 
evaluation  of  MIST  results  should  be  non-problematic  for  judges.  When  it  comes  to  the 
first  assumption,  we  have  described  the  PMI-IR  algorithm  which  proved  to  be  able  to 
answer  subcognitive  questions  in  a  human-like  manner.  This  suggests  that  a  relatively 
simple  statistical  approach  is  effective  when  it  comes  to  MIST-type  questions.  On  the 
other  hand,  the  results  of  our  study  suggest  that  these  questions  are  in  fact  fairly  simple 
for  human  participants.  What  is  more,  the  answers  gathered  for  the  second  question¬ 
naire  have  a  high  level  of  agreement  between  subjects,  which  is  in  line  with  McKinstry's 
predictions. 

Things  are  worse  when  it  comes  to  the  judge's  role  in  the  MIST.  The  results  of 
our  study  indicate  that  the  task  of  evaluating  MIST  answers  as  human-like  may  be 
problematic.  Subjects  who  played  the  role  of  judges  in  our  study  were  far  from  reaching 
agreement  over  the  answers  provided  to  MIST  questions. 

Naturally,  we  are  not  claiming  that  the  presented  results  are  a  conclusive  argu¬ 
ment  against  MIST.  Our  aim  was  to  evaluate  the  idea  and  consider  its  potential  weak 
points.  MIST  offers  a  well-defined  framework  for  testing  artificial  agents.  It  does  not 
eliminate  the  judge  bias  from  the  test,  but  certainly  the  idea  of  automated  evaluation  of 
MIST  answers  reduces  this  issue.  We  also  find  the  idea  of  using  only  yes/ no  question 
and  its  justification  provided  by  McKinstry  to  be  a  convincing  one  although  this  aspect 
of  MIST  needs  further  study.  In  our  opinion,  MIST  is  still  one  of  the  best  alternatives  to 
TT  "on  the  market"  and  the  most  promising  one  when  it  comes  to  potential  practical 
applications.  Certainly,  the  strongest  points  of  the  MIST  project  are  the  crowdsourcing 
underlying  its  questions-responses  (; mindpixels )  corpus  and  the  well  operationalized  idea 
of  the  statistical  evaluation  of  a  tested  agent. 
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Appendix:  MIST  questions  used  in  the  study 


1.  Is  Madonna  a  woman?  YES 

2.  Is  Blackjack  a  card  game?  YES 

3.  Does  Santa  Claus  deliver  gifts  on  Easter?  NO 

4.  Has  SETI  discovered  extraterrestrial  life?  NO 

5.  Is  wood  harder  than  diamond?  NO 

6.  Do  some  people  find  genetic  engineering  to  be  frightening?  YES 

7.  In  general,  do  we  need  light  to  see?  YES 

8.  Is  a  cell  something  that  can  contain  either  a  nucleus  or  a  prisoner?  YES 

9.  Is  sun  black?  NO 

10.  Is  air  solid?  NO 

11.  Is  pizza  a  food  for  humans?  YES 

12.  Is  the  Milky  Way  a  galaxy?  YES 

13.  Does  1  plus  1  equal  3?  NO 

14.  Are  there  over  400  days  in  a  year?  NO 

15.  Does  one  times  five  equal  five  hundred?  NO 
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16.  Are  most  cats  furry?  YES 

17.  Is  forward  the  opposite  of  backwards?  YES 

18.  Is  Napoleon  dead?  YES 

19.  Is  it  right  to  take  something  that  is  not  yours  without  permission 

from  the  owner?  NO 

20.  Is  Idaho  in  Europe?  NO 

21.  Is  this  sentence  in  Spanish?  NO 

22.  Is  Violet  a  color?  YES 

23.  Is  our  sun  the  only  star  in  space?  NO 

24.  When  you  throw  a  stone  in  the  air,  does  it  keep  going  up  forever?  NO 

25.  Does  PC  stand  for  "personal  computer"?  YES 

26.  Do  people  sometimes  lie?  YES 

27.  Do  humans  live  on  Mars?  NO 

28.  Are  whales  types  of  fish?  NO 

29.  Is  Greece  a  country?  YES 

30.  Is  the  earth  as  hot  as  the  sun?  NO 

31.  Is  winter  weather  warm?  NO 

32.  Was  Vincent  van  Gogh  a  painter?  YES 

33.  Is  night  darker  than  day?  YES 

34.  Is  war  better  than  peace?  NO 

35.  Is  wood  the  same  as  metal?  NO 

36.  Does  a  person  want  to  eat  when  he  is  hungry?  YES 

37.  Is  a  second  shorter  than  a  minute?  YES 

38.  Did  Germany  win  WWII?  NO 

39.  Is  there  a  maximum  number?  NO 

40.  Are  locks  more  useful  when  you  have  the  key?  YES 

41.  Is  toothpaste  a  better  alternative  than  sand  for  brushing  teeth?  YES 

42.  Do  you  need  to  get  a  license  to  have  children?  NO 

43.  Is  extraterrestrial  life  possible?  YES 

44.  Does  11  plus  11  equal  22?  YES 

45.  Does  a  week  consist  of  seven  days?  YES 

46.  Is  pregnancy  contagious?  NO 

47.  Is  it  safe  to  drive  a  car  whilst  drunk?  NO 

48.  Do  humans  regularly  eat  other  humans?  NO 

49.  Does  wood  comes  from  trees?  YES 

50.  Do  all  mammals  need  oxygen  to  live?  YES 


